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Abstract 

One challenge in applying bioinformatic tools to clinical or biological data is high number of features that might 
be provided to the learning algorithm without any prior knowledge on which ones should be used. In such 
applications, the number of features can drastically exceed the number of training instances which is often limited 
by the number of available samples for the study. The Lasso is one of many regularization methods that have 
been developed to prevent overfitting and improve prediction performance in high-dimensional settings. In this 
paper, we propose a novel algorithm for feature selection based on the Lasso and our hypothesis is that defining a 
scoring scheme that measures the "quality" of each feature can provide a more robust feature selection method. 
Our approach is to generate several samples from the training data by bootstrapping, determine the best 
relevance-ordering of the features for each sample, and finally combine these relevance-orderings to select highly 
relevant features. In addition to the theoretical analysis of our feature scoring scheme, we provided empirical 
evaluations on six real datasets from different fields to confirm the superiority of our method in exploratory data 
analysis and prediction performance. For example, we applied FeaLect, our feature scoring algorithm, to a 
lymphoma dataset, and according to a human expert, our method led to selecting more meaningful features than 
those commonly used in the clinics. This case study built a basis for discovering interesting new criteria for 
lymphoma diagnosis. Furthermore, to facilitate the use of our algorithm in other applications, the source code that 
implements our algorithm was released as FeaLect, a documented R package in CRAN. 



Introduction 

To build a robust classifier, the number of training 
instances is usually required to be more than the number 
of features. In many real life applications such as bioinfor- 
matics, natural language processing, and computer vision, 
a high number of features might be provided to the learn- 
ing algorithm without any prior knowledge about which 
ones should be used. Therefore, the number of features 
can drastically exceed the number of training instances 
and the model is subject to overfit the training data. Many 
regularization methods have been developed to prevent 
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overfitting and to improve the generalization error bound 
of the predictor in this learning situation. 

Most notably, Lasso [1] is an ^-regularization technique 
for linear regression which has attracted much attention 
in machine learning and statistics. The same approach is 
useful in classification because any binary classification 
problem can be reduced to a regression problem by treat- 
ing the class labels as real numbers, and consider the sign 
of the model prediction as the class label. The features 
selected by the Lasso depends on the regularization para- 
meter, and the set of solutions for all values of this free 
parameter is provided by regularization path [2] . Although 
efficient algorithms exist for recovering the whole regulari- 
zation path for the Lasso [3], finding a subset of highly 
relevant features which leads to a robust predictor is a 
prominent research question. 
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In this paper, we propose a novel algorithm for feature 
selection based on the Lasso and our hypothesis is that 
defining a scoring scheme that measures the "quality" of 
each feature can provide a more robust feature selection 
method. Our approach is to generate several samples 
from the training data by bootstrapping, determine the 
best relevance-ordering of the features for each sample, 
and finally combine these relevance-orderings to select 
highly relevant features. In addition to the theoretical 
analysis of our feature scoring scheme, we provided 
empirical evaluations using a real-life lymphoma dataset 
as well as several UCI datasets, which confirms the super- 
iority of our method in exploratory data analysis and pre- 
diction performance. 

Background and previous work 

Lasso is an ^-regularization technique for least-square 
linear regression: 

" 1 2 
t=l 

where the response random variable Fe lis dependent 
on a ^-dimensional covariate X e R d , and the training 
data D = {(x;,y;)}" =1 is independently and identically 
sampled from a fixed joint distribution P XY . It is well 
known that the ^-regularization term shrinks many com- 
ponents of the solution to zero, and thus performs feature 
selection [4] . There has been also some variants, such as 
elastic nets [5], to select highly-correlated predictive fea- 
tures. The number of selected features in eqn (1) is con- 
trolled by the regularization parameter X. 

A common practice is to find the best value for X by 
cross-validation to maximize the prediction accuracy. Hav- 
ing found the best value for the regularization parameter, 
the features are selected based on the non-zero compo- 
nents of the global and unique minimizer of the training 
objective in equation (1). However, recent research on the 
consistency of the Lasso [4,6-10] shows that a fixed value 
of X for all n will not result in a consistent estimate for the 
parameter vector [7]. Now, the question is what would be 
a proper value for X as a function of n with a theoretical 
basis? 

Various decaying schemes of the regularization para- 
meter were studied [4,7,8,11] and it is shown that under 
specific settings, Lasso selects the relevant features with 
probability one and the irrelevant features with a positive 
probability less than one, provided that the number of 
training instances tends to infinity. To do a better feature 
selection, note that each run of the cross-validation gives 
the value of the regularization parameter X and the corre- 
sponding selected-features. If several samples were avail- 
able from the underlying data distribution, irrelevant 
features could be removed by simply intersecting the set of 



selected features for each sample. The idea in [7] is to pro- 
vide such datasets by resampling with replacement from 
the given training dataset using the bootstrap method [12]. 
This approach leads to Bolasso algorithm for feature selec- 
tion that is theoretically motivated by the proposition 1. 
Proposition 1. [7]Suppose P X y satisfies some mild 

assumptions and let ^ _ \ for a fixed constant 

fio>0. Let J represents the index of the true relevant fea- 
tures, and J denote the index of relevant features found 
by Bolasso. Then, the probability that Bolasso does not 
select the correct model is upper-bounded by: 

a \ An logn logm 

Pr(7 ^J) < mA ie - Ain +A 3 — 2-+A 4 _5— , 
V / A m 

no- 
where m >1 is the number of bootstrap samples, and 
all Ai s are positive constants. 

Now, if we send m to infinity slower than e Al " < then 
with probability tending to one Bolasso will select J, 
exactly the relevant features. The proposition 1 guarantees 
the performance of Bolasso only asymptoticly, i.e. when 
n — > oo. However, in real applications where the number 
of training samples is often limited, the probability of 
selecting relevant features can be significantly less than 1. 
One of the main goals of our proposed framework in this 
paper is to address this problem by scoring the features. 

Previous studies have shown that there is room for 
improving Bolasso [7,8] . For example, while on synthetic 
data it outperforms similar methods such as ridge regres- 
sion, Lasso, and bagging of Lasso estimates [13], Bolasso is 
sometimes too strict on real data because it requires the 
relevant features to be selected in all bootstrap runs. 
Bolasso-S, a soft version of Bolasso, performs better in 
practice because it relaxes this condition and selects a fea- 
ture if it is chosen in at least a user-defined fraction of the 
bootstrap replicates (a threshold of 90% is considered to 
be enough). Bolasso-S is more flexible and thus, more 
appropriate for the practical models that are not extremely 
sparse [8]. 

Our contributions 

In this paper, we develop FeaLect algorithm that is 
softer than Bolasso in the following three directions: 

♦ For each bootstrap sample, Bolasso considers only 
one model that minimizes the training objective C 
in eqn (1), whereas we include information provided 
by the whole regularization path, 

♦ Instead of making a binary decision of inclusion or 
exclusion, we compute a score value for each feature 
that can help the user to select the more relevant ones, 

♦ While Bolasso-S relies on a threshold, our theoreti- 
cal study of the behaviour of irrelevant features leads 
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to an analytical criterion for feature selection with- 
out using any pre-defined parameter. 

We compared the performance of Bolasso, FeaLect, and 
Lars algorithms for feature selection on six real datasets in 
a systematic manner. The source code that implements 
our algorithm was released as FeaLect, a documented 
R package in CRAN. 

Feature scoring and mathematical analysis 

In this section, we describe our novel algorithm that scores 
the features based on their performance on samples 
obtained by bootstapping. Afterwards, we present the 
mathematical analysis of our algorithm which builds the 
theoretical basis for its proposed automatic thresholding 
in feature selection. 

The FeaLect algorithm 

Our feature selection algorithm is outlined in Figure 1 and 
described in Algorithm 1. Let B be a random sample with 
size yn generated by choosing from the given training data 
D without replacement, where n = \D\ and ye (0, 1) is a 
parameter that controls the size of sample sets. Using a 
training set B, we apply the Lars algorithm to recover the 
whole regularization path efficiently [3]. Let F B be the set 
of selected features by the Lasso when X allows exactly 
k features to be selected. The number of selected features 
is decreasing in X and we have: 

0 = F B C...F B CF B +1 C---CF B =F. 



For each feature f, we define a scoring scheme 
depending on whether or not it is selected in F B : 

6fe(/J ' jo otherwise ( > 

The above randomized procedure is repeated several 
times for various random subsets B to compute the aver- 
age score of /when exactly k features are selected, i.e. 
Eb[S^(/)] is estimated empirically. According to our 
experiments, the convergence rate to the expected score is 
fast and there is no significant difference between the aver- 
age scores computed by 100 or 1000 samples (Figure 2). 
The total score for each feature is then defined as the sum 
of average scores: 

S{f):=J2^[S B k {f)] (3) 

k 

Algorithm 1 Feature Scoring 
1: for t = 1 to m do 

2: Sample (without replacement) a random subset 
BCD with size y\D\ 
3: Run Lars on B to obtain F B , ...,F B 
4: Compute S B , ...,S B using eqn (2) 
5: for k e {1, d) do 

6: Update the feature scores for all feature 

f:S(f^S(f) + S B (f)/m 
7: end for 
8: end for 
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Figure 1 Overview of bootstraping performed by FeaLect. A row and a column of the gray data matrix correspond to a feature and a case, 
accordingly. 1000 models are trained, each fitted to a random subset that contains | of cases using Lasso technigue [1], Without any 
assumption from a-priori knowledge, all features are included for training the models. Then the selected features are scored by computing an 
average vote (eg. 3) to select the most predictive ones. 
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9: Fit a 3-segment spline g 2 (.), ^(-W on log- 

scale feature score curve (see the text for more 
information) 

10: return features corresponding to g 3 as informa- 
tive features 

Before describing the rest of the algorithm, let us have 
a look at the feature scores for our lymphoma classifica- 
tion problem (The task and data set is described in 
details in the Experiment section). Figure 2 depict the 
total score of features in log-scale, where features are 
sorted according to their increasing total scores. The 
feature score curve is almost linear in the middle and 
bending at both ends. We hypothesize that features with 
a "very high score" in the top non-linear and bending 
part of the curve are good candidates for informative 
features. Furthermore, the linear middle-part of the 
curve consists of features that are responsible for the 
model to get overfitted and therefore we call them irre- 
levant features. In the next section, a formal definition 
will be provided to clarify this intuitive idea and we 
show how this insight can be very helpful in identifying 
informative features. 

The final step of our feature selection algorithm is to 
fit a 3-segment spline model to the feature score curve: 
the first quadratic lower-part captures the low score fea- 
tures, the linear middle-part captures irrelevant features, 
and the last quadratic upper-part captures high-score 
informative features. As discussed below, the middle lin- 
ear-part provides an analytic threshold for the score of 
relevant features: The features with score above this 
threshold are reported as informative features which can 
be used for training the final predictor and/or explana- 
tory data analysis. 



The analysis 

The aim of this analysis is to provide a mathematical 
explanation for the linearity of the middle part of the 
scoring function (Figure 2), and also a justification for 
why the features corresponding to this part can be 
excluded. We first present a probabilistic interpretation 
of the feature scores, and then we provide a precise defi- 
nition of an irrelevant feature. Our definition formalizes 
the fact that such a feature is selected by the Lasso if 
and only if a particular fixed finite subset U of instances 
is included in the random training set, whereas a rele- 
vant feature should be selected for almost any general 
U. We estimate the probability that a random sample 
B c D contains U as n grows to infinity. Finally, we 
show that asymptotically, the log of the scores for irrele- 
vant features is linear in \U\. This explains the linearity 
of the middle part of the feature score curve in Figure 2. 

Proposition 2. Suppose Pr (f = f) is the probability of 
selecting a feature f by the Lasso in some stage of our 
feature selection method in Algorithm 1. Then, the prob- 
ability distribution of the random variable f is given by: 

Proof. By conditional probability: 

Pr(f = /0 =£ Pr(f = /;|B)Pr(B) 
= EE OMWilfeF*) 

B k-1 

Pr(feF«)Pr(B)) 

d 

= EE S^)Pr(feF«)Pr(B) 

B k-1 
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Since we have not imposed any prior assumption, we 
put a uniform distribution on Pr(f ef') to get: 

Pr(f = /0 = z££s£(/i)Pr(B) 

B k 

= ^£^c/o) 

= \m. 

□ 

The following definition formalizes the idea that irrele- 
vant features depend only on a specific subset of the 
whole data set. 

Definition 3. For any subset of samples U £ A and 
any feature fi we say that f over-fits on U if: 

Vfc,VB :/,eF^UcB 

In words, f is selected in if and only if B contains U. 
Next, we derive the probability of including a specific set 
U in a randomly generated sample. 

Lemma 4. For any U £ A, we have: 

lim Pr(l7 c B) = y r 

n^oo B 

where r is the number of samples in U and y is the frac- 
tion of samples chosen for a random set B. 

Proof. Assuming B has jn members chosen without 
replacement, we have: 



_ (n-r)!(yn)! 
n\{yn—r} ! 




r-1 

= Y[[{yn-i){n-iy l ] 

1=0 



=K r fi( i+ai ^) 

= ^(1+0(0). 

The first line of the above proof relies on the assump- 
tion that the members of the random set B are chosen 



without replacement, and the claim derives from the 
fact that 7 is a fixed constant. D 

The following theorem concludes our argument for 
the exponential behavior of total score of irrelevant fea- 
tures. It relates the probability of selecting a feature^ 
irrelevant on U to the probability of including U in the 
sample. 

Theorem 5. If a feature f over-fits on a set of samples 
U with size r, then: 

lim S(fi) = dy r . 

n— >oo 

Proof. From proposition 2 we have: 
S(f i )=dVx{i = f l ) 

= d^Pr(/ieF B )Pr(B)^Pr 

B 

= dPr(L/cB) 
= d{ Y r + 0(n- 1 )). 

The last equation was proved in lemma 4, and the one 
before that from definition 3. □ 

Although we presented the above arguments for the 
Lasso, it also should work for any other feature selection 
algorithm which exhibits linearity in its feature score 
curve. That is, features corresponding to the linear part 
of the scoring curve are indeed the irrelevant features for 
that algorithm, and therefor, the features on non-linear 
upper-part should be considered as informative ones. 
Obviously the features on the non-linear lower-part are 
not interesting for the any prediction task because their 
scores are even less than the irrelevant features. We spec- 
ulate that these features do not present a linear behavior 
because not only they are not relevant to the outcome, 
but also they are not associated with any particular set U, 
meaning they are not even included in an over-fitted 
model. A follow-up study may investigate this hypothesis 
further. 

Experiment with real data 

We applied FeaLect on several datasets to test the per- 
formance of our feature selection algorithm in real life 
conditions. 

Lymphoma 

Lymphoma is a cancer that begins in the lymphatic cells of 
the immune system, and is presented as a solid tumor of 
lymphoid cells [14]. Just as cancer represents many differ- 
ent diseases, lymphoma represents many different cancers 
of lymphocytes [15]. We applied our algorithm for auto- 
matic diagnosis of lymphoma types based on flow cytome- 
try (FCM) data [16]. Usually 15-30 markers are used for 
each patient, where each marker distinguishes a particular 
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cell type based on its protein content. We analyzed flow 
cytometry data of 85 lymphoma patients who had been 
diagnosed at BC Cancer Agency, Vancouver, Canada 
between 2004-2007. The patients were grouped into four 
top-level disease subgroups and the goal was to build a 
classifier that could diagnosis 20 test patients based on 
their FCM data. For each group, we trained a classifier 
to distinguish that group versus the others. These four 
classifiers were then combined to provide the top-level 
diagnosis. 

Data preparation and feature extraction 

The blood sample of each patient was divided into 7 
portions, and each portion is examined in a different 
tube by the cytometer. Each tube gives 5 dimensional 
data of 20,000-70,000 blood cells. In the first analysis 
step, we used a spectral clustering approach to cluster 
the cells in each tube into cell populations. It was not 
possible to directly apply classical spectral clustering 
[17-20] to the lymphoma data because it involved com- 
puting eigenvectors of a big «-by-« matrix where n 
ranges from 20,000 to 70,000. Instead, we have made 
use of SamSPECTRAL that is our enhanced spectral 
clustering method capable of analyzing large amount of 
data in a reasonable amount of time; it has also a good 
memory footprints [21]. 

SamSPECTRAL performs a specific sampling stage 
called faithful sampling to reduce the size of data for 
spectral clustering. Our data reduction scheme is 
designed to preserve density information and can be 
briefly stated as follows: 

1. Set all points to be unregistered and assume the 
parameter h is adjusted appropriately. 

2. Pick a random unregistered point p (the represen- 
tative of a community) and find all unregistered data 
points within distance h from p. 

3. Put all of these points in a set called community 
p, and label them as registered. 

4. Repeat the above two steps until no unregistered 
points are left. 

After the above steps, the similarity between the com- 
munities is defined by summing up similarities between 
their members, and the resulting similarity matrix is 
passed to a classical spectral clustering algorithm. Because 
this matrix is much smaller than the original similarity 
matrix (3000-by-3000 instead of 20,000-by- 20,000 in our 
experiments), its eigenvectors can be efficiently computed 
in reasonable time. 

Each cluster computed by SamSPECTRAL was regarded 
as a "cell population" that could potentially have informa- 
tion about the lymphoma type. Without imposing any a 
priori knowledge on the importance of any population, we 
considered their sizes and their means in all dimensions as 



features. In total, 276 features were obtained and ignoring 
those with very low variance, 224 were kept for feature 
selection and classification. 
Feature selection and classification 

Since the number of features was considerably larger 
than the number of training samples (p = 224, n = 85), 
a careful feature selection scheme was needed. To 
reduce the computation time required, we imposed a 
pre-defined upper bound 60 on the number of features 
based on a priori knowledge from the biology. We initi- 
ally applied £ 1 -regularization technique, and it was not 
by its own enough to prevent overfitting. Reducing the 
regularization parameter did not improve the results as 
we observed that some of the features that were known 
to be biologically and clinically interesting were ignored. 
We also applied Bolasso [7] to select relevant features. 
For most bootstrap samples of our data, the global error 
defined by equation (1) was minimized when only a few 
(less than 4) features were selected. Because the inter- 
section of selected features from several (more than 10) 
samples was empty, Bolasso could not result in appro- 
priate feature selection. 

Next, we applied our feature selection algorithm. In our 
experiment, we set y = | to be the fraction of instances 
used in each iteration for training. Figures 2(a) and 2(b) 
depict the resulting feature scores for follicular lym- 
phoma type after m = 1000 random runs. The log-scale 
plot consisted of a linear part that confirmed our hypoth- 
esis experimentally. Similarly, the plots for other types of 
lymphoma had also linear parts. Furthermore, we re-ran 
the experiments with m = 5000 random samples and the 
results did not vary significantly indicating a fast conver- 
gence rate for the feature scores. 

To select the informative features, we fitted a 3-segment 
spline model to each curve. The features corresponding to 
the middle linear segment were considered as irrelevant 
ones, and ignored for the rest of analysis. Features with 
score higher than score of these irrelevant features were 
selected as informative features. We observed that unlike 
the pure Lasso, all features that were known to be biologi- 
cally and clinically interesting were selected by our 
approach. Prediction accuracy was improved confirming 
the efficiency of our feature selection method. We used 
our selected features to build a linear classifier that had 
precision, recall and F-measure 98%, 94% and 96%, respec- 
tively while the best result we obtained with the pure 
Lasso was 93%, 82% and 87%, respectively. 

For further evaluation in a data explarotary setting, we 
interrogated the selected features together with our clinical 
collaborators for novel biomarker discovery. A task which 
would be challenging otherwise, due to large number of 
features and huge amount of clinical work required to 
evaluate each individual feature. We narrowed down 
our attention to those features which were relevant to 



Zare et at. BMC Genomics 2013, 14(Suppl 1):S14 
http://www.biomedcentral.com/1471-2164/14/S1/S14 



Page 7 of 9 



Table 1 Comparsion of area under the ROC curve between FeaLect, lars, and Bolasso on six different datasets. 

Dataset Total samples # of features 20 training samples 40 training samples Reference 









Bolasso 


lars 


FeaLect 


Bolasso 


lars 


FeaLect 




Lymphoma 


258 


505 


0.62 


0.81 


0.84 


0.67 


0.87 


0.88 


current 


Colon 


62 


2000 


0.50 


0.57 


0.65 


0.47 


0.64 


0.75 


[23] 


Arcene 


100 


10000 


0.51 


0.59 


0.64 


0.50 


0.66 


0.72 


[26] (UCI) 


SECOM 


208 


590 


0.51 


0.57 


0.61 


0.52 


0.61 


0.64 


[25] (UCI) 


Connectionist 


208 


60 


0.63 


0.76 


0.78 


0.67 


0.78 


0.79 


[27] (UCI) 


ISOLET 


479 


617 


0.90 


0.99 


1.00 


0.91 


1.00 


1.00 


[28] (UCI) 



lymphoma types based on our feature selection algorithm, 
but were not previously reported to be biologically rele- 
vant. This approach resulted in the discovery of interesting 
new criteria for lymphoma diagnosis that have clinical 
applications in practice [22], 

Additional real datasets 

In addition to our lymphoma flow cytometry data, we 
validated the performance of FeaLect on five other data- 
sets including the well-known colon gene expression 
(Table 1). Colon dataset contains expression of 2000 
genes in 22 normal and 40 colon cancer tissues [23] and 
it is a benchmark for gene expression analysis [24]. All 
additional four datasets are from UCI (University of Cali- 
fornia, Irvine) Machine Learning Repository [25]. Arcene 
contains mass-spectrometric data for cancer and normal 
cases [26], variables of SECOM were collected from sen- 
sors and process measurements in complex modern 
semi-conductors with the goal of enhancing current busi- 
ness improvement techniques [25]. We used a version of 
SECOM dataset balanced by randomly selecting equal 



number of positive and negative samples. The learning 
task for Connectionist dataset is to train a network to 
discriminate between sonar signals bounced off a metal 
cylinder and those bounced off a roughly cylindrical rock 
[27]. ISOLET is a natural language processing dataset 
generated from speech of 150 subjects with the goal of 
identifying which letter-name was spoken [28]. In the 
current study, we only considered letters A and B as posi- 
tive and negative samples and excluded the rest of 
samples. 

Table 1 compares the performance of Bolasso, pure 
lars, and FeaLect on the studied datasets. Training sam- 
ples were selected uniformly at random and area under 
the ROC curves (AUC) were computed using the rest of 
samples (Figure 3). For each dataset, we repeated this 
procedure 100 times and reported the average AUC to 
avoid any dependency on the random selection of train- 
test sets. Both FeaLect and lars always outperformed 
Bolasso. When only 20 random training samples were 
provided, FeaLect provides significantly better than pure 
lars except ISOLET dataset. The number of samples in 




500 1 000 

Number of selected features 
(a) Colon 




100 200 300 

Number of selected features 



400 



(b) Lymphoma 



Figure 3 Variation of area under the ROC curve when different number of features are used. The features are sorted by applying FeaLect 
on 20 random training samples. Then, the training samples and the highly scored features are considered to build linear classifiers by lars. The 
best AUC is reported by testing on a set of validating samples disjoint from the training set. For both lymphoma and colon datasets, the 
performance of the optimum classifier decreases if all features are provided to lars. This observation practically shows the advantage of using a 
imited number of highly scored features over pure lars. 
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(a) Colon (b) Lymphoma 

Figure 4 Comparing ROC curves between FeaLect and lars. The blue curve represents the ROC curve of the best Lasso model trained on 20 
random samples using all available features, and the red curve shows the performance of the best Lasso model when only 61 and 36 top 
features are provided from colon and lymphoma datasets respectively. While FeaLect always performs better than pure lars, the difference is 
more significant for colon dataset than lymphoma dataset. 



ISOLET dataset is more than other datasets, enough such 
that both methods performed well. The superiority of Fea- 
Lect over lars decreases as the number of training samples 
increases from 20 to 40, except for Colon and Arcene 
datasets for which FeaLect is still better than lars by .09 
and 0.6, accordingly. Interestingly, these two datasets are 
the ones with 2000 and 10000 features that are consider- 
ably higher dimensional than other datasets. This obser- 
vation reassures that FeaLect is advantageous over lars 
in high-dimensional settings and their performance 



converges as "adequate" number of samples are provided 
(Figures 4 and 5). 

Conclusion 

We have presented FeaLect, a novel feature selection 
algorithm, based on Lasso (Figure 1). The idea of FeaLect 
is to combine the selected feature sets to score the fea- 
tures according to their relevancy and prediction power. 
An advantage of FeaLect compared to many other feature 
selection methods is to provide a ranking for features 




10 ""iS 20 25 30 35 20 40 60 80 100 120 

Number of training samples Number of training samples 

(a) Colon (b) Lymphoma 



Figure 5 Improvements in the area under the ROC curves by increasing the number of training samples Except for Bolasso on colon 
dataset, the average performance increases as more training samples are provided. While FeaLect and lars converge to a common asymptotic 
performance on lymphoma dataset, FeaLect is consistently superior to pure lars on colon dataset because the number of training samples is 
very limited. Table 1 presents similar superiority for other datasets with relatively low instances. 
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relevance, which can be leveraged in better prediction 
models and/or exploratory data analysis. We reported a 
cancer classification problem (lymphoma diagnosis) for 
which distinguishing the most relevant features is of 
great interest from the biological and clinical point of 
view. FeaLect has led to the discovery of novel biomar- 
kers for this disease to help clinicians in lymphoma sub- 
type diagnosis [22]. The log-scale score curve can be 
studied in more detail and explaining its behavior in the 
non-linear parts is potentially a source of insight. Shed- 
ding more light on the Lasso performance by studying 
feature scores is a possible future direction of this study. 

Furthermore, we provided empirical and quantitative 
evaluations on five other real-world datasets (from differ- 
ent fields) to confirm the superiority of our method, in 
prediction performance, compared to the baselines. 
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